Skip to content

Conversation

@wirthual
Copy link

Work for #3430

Took method 1:1 from OpenAI cookbook. However the example only shows for certain models. How should other models be handled? Quick test with gpt-5 showed a token number that differed from this method.

gpt-5
Warning: gpt-5 may update over time. Returning num tokens assuming gpt-5-2025-08-07.
110 prompt tokens counted by num_tokens_from_messages().
109 prompt tokens counted by the OpenAI API.
Test script
from openai import OpenAI
import os
import tiktoken

def num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using o200k_base encoding.")
        encoding = tiktoken.get_encoding("o200k_base")
    if model in {
        "gpt-3.5-turbo-0125",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        "gpt-4o-mini-2024-07-18",
        "gpt-4o-2024-08-06",
        "gpt-4.1-2025-04-14",
        "gpt-5-2025-08-07",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0125")
    elif "gpt-4o-mini" in model:
        print("Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.")
        return num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18")
    elif "gpt-4o" in model:
        print("Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.")
        return num_tokens_from_messages(messages, model="gpt-4o-2024-08-06")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    elif "gpt-5" in model:
        print("Warning: gpt-5 may update over time. Returning num tokens assuming gpt-5-2025-08-07.")
        return num_tokens_from_messages(messages, model="gpt-5-2025-08-07")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

example_messages = [
    {
        "role": "system",
        "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English.",
    },
    {
        "role": "system",
        "content": "New synergies will help drive top-line growth.",
    },
    {
        "role": "system",
        "content": "Things working well together will increase revenue.",
    },
    {
        "role": "system",
        "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",
    },
    {
        "role": "system",
        "content": "Let's talk later when we're less busy about how to do better.",
    },
    {
        "role": "user",
        "content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",
    },
]

for model in [
    "gpt-3.5-turbo",
    "gpt-4",
    "gpt-4o",
    "gpt-4o-mini",
    "gpt-5"
    ]:
    print(model)
    # example token count from the function defined above
    print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
    # example token count from the OpenAI API
    response = client.chat.completions.create(model=model,
    messages=example_messages)
    
    print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')
    print()

return event_loop


def num_tokens_from_messages(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OpenAI specific so it should live in models/openai.py


def num_tokens_from_messages(
messages: list[ChatCompletionMessageParam] | list[ResponseInputItemParam],
model: OpenAIModelName = 'gpt-4o-mini-2024-07-18',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a default value

else:
raise NotImplementedError(
f"""num_tokens_from_messages() is not implemented for model {model}."""
) # TODO: How to handle other models?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to reverse engineer the right formula for gpt-5?

As long as we document that this is a best effort calculation and may not be accurate down to the exact token, we can have one branch of logic for "everything before gpt-5" and one for every newer. If future models have different rules, we can update the logic then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with a decreased primer the calculation for gpt5 is more accurate.

Should the method from the cookbook be the default for all other models?

try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print('Warning: model not found. Using o200k_base encoding.') # TODO: How to handle warnings?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No warnings please, let's just make a best effort

)


def num_tokens_from_messages(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a private function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added _

elif 'gpt-5' in model:
return num_tokens_from_messages(messages, model='gpt-5-2025-08-07')
else:
raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}.""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify all of this as if 'gpt-5' in model: <do the new thing> else: <do the old thing>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the is executed for all other models except gpt5?

if 'gpt-5' in model:
        tokens_per_message = 3
        final_primer = 2  # "reverse engineered" based on test cases
else:
    # Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls
    tokens_per_message = 3
    final_primer = 3  # every reply is primed with <|start|>assistant<|message|>

Or only for the models explicitly in the cookbook and raise error for others?

'gpt-4o-2024-08-06',
}:
tokens_per_message = 3
final_primer = 3 # every reply is primed with <|start|>assistant<|message|>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include a link to the doc we took this from

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link added.

'gpt-5-2025-08-07',
}:
tokens_per_message = 3
final_primer = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it explicit that this one was "reverse engineered"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test the entire exception message as here:

async def test_anthropic_model_usage_limit_exceeded(
allow_model_requests: None,
anthropic_api_key: str,
):
model = AnthropicModel('claude-sonnet-4-5', provider=AnthropicProvider(api_key=anthropic_api_key))
agent = Agent(model=model)
with pytest.raises(
UsageLimitExceeded,
match='The next request would exceed the input_tokens_limit of 18 \\(input_tokens=19\\)',
):
await agent.run(
'The quick brown fox jumps over the lazydog.',
usage_limits=UsageLimits(input_tokens_limit=18, count_tokens_before_request=True),
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a test for a system other than openai, right? You can get that from building an OpenAIChatModel with a different provider, like OllamaProvider for example

tokens_per_message = 3
final_primer = 2 # "reverse engineered" based on test cases
else:
# Adapted from https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#6-counting-tokens-for-chat-completions-api-calls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the cookbook again, I think we should also try to implement support for counting the tokens of tool definitions:

https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#7-counting-tokens-for-chat-completions-with-tool-calls

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of testing against our own hard-coded values, can we compare against the real token usage data returned by the API (that will be recorded in cassettes)?

I'd also like to see tests with more complicated message structures, e.g. with tool calls

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these files so large?

100,286 additions

Can we remove them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiktoken library downloads necessary encoding files e.g. this one. Usually thats cached in TIKTOKEN_CACHE_DIR.

To allow execution of the tests independently the cassette for each test stores this file in the current version.

Is there a better way to handle these MB sized files? Maybe put them in the cache in a fixture?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense... What do you think about mocking the tiktoken calls, so we can keep these files out of the repo?

num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for value in message.values():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird, as it assumes every string value in the message dict will be sent to the model. That may be the case for ChatCompletionMessageParam, but not for ResponseInputItemParam, which is a union that includes things like Message:

class Message(TypedDict, total=False):
    content: Required[ResponseInputMessageContentListParam]
    """
    A list of one or many input items to the model, containing different content
    types.
    """

    role: Required[Literal["user", "system", "developer"]]
    """The role of the message input. One of `user`, `system`, or `developer`."""

    status: Literal["in_progress", "completed", "incomplete"]
    """The status of item.

    One of `in_progress`, `completed`, or `incomplete`. Populated when items are
    returned via API.
    """

    type: Literal["message"]
    """The type of the message input. Always set to `message`."""

I don't think those status and type fields end up with the model. But it'd be worth verifying by comparing our calculation with real data from the API, as I suggested below in the tests.

for message in messages:
num_tokens += tokens_per_message
for value in message.values():
if isinstance(value, str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also don't currently handle lists of strings properly, for example ChatCompletionMessageParam can be ChatCompletionUserMessageParam:

class ChatCompletionUserMessageParam(TypedDict, total=False):
    content: Required[Union[str, Iterable[ChatCompletionContentPartParam]]]
    """The contents of the user message."""

    role: Required[Literal["user"]]
    """The role of the messages author, in this case `user`."""

    name: str
    """An optional name for the participant.

    Provides the model information to differentiate between participants of the same
    role.
    """

content may just be a str, but could also be a list of ChatCompletionContentPartTextParam:

class ChatCompletionContentPartTextParam(TypedDict, total=False):
    text: Required[str]
    """The text content."""

    type: Required[Literal["text"]]
    """The type of the content part."""

We shouldn't exclude that text from the count.

Same for ResponseInputItemParam, which can have text hidden inside lists.

Unfortunately OpenAI makes it very hard for us to calculate this stuff correctly, but I'd rather have no count_tokens method than one that only works in very specific unrealistic scenarios -- most users are going to have more complicated message histories than the ones we're currently accounting for. So I think we should either implement some smarter behavior (and verify in the tests that it works!), or "give up". Let me know if you're up for the challenge :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am up try to count the tokens for more complicated histories. Seems there are quite a few possible inputs based on your comment.

Are there any test cases which I can use as a starting point which represent a more complicated structure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wirthual Not specifically, but if you look at the types in the ModelRequest.parts, UserPromptPart.content and ModelResponse.parts type unions, it's pretty easy to (have AI) build one of each

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wirthual Just found https://github.com/pamelafox/openai-messages-token-helper which may be worth using or looking at for inspiration

@DouweM DouweM changed the title WIP: Add count_tokens for openAI models Implement OpenAI token counting using tiktoken Nov 21, 2025
num_tokens += tokens_per_message
for value in message.values():
if isinstance(value, str):
num_tokens += len(encoding.encode(value))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this (or the get_encoding call further up?) could download a large file, but tiktoken is sync not async, we should wrap the call that may do a download in _utils.run_in_executor to run it in a thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants